The stock market has consistently proven to be a good place to invest in and save for the future. There are a lot of compelling reasons to invest in stocks. It can help in fighting inflation, create wealth, and also provides some tax benefits. Good steady returns on investments over a long period of time can also grow a lot more than seems possible. Also, thanks to the power of compound interest, the earlier one starts investing, the larger the corpus one can have for retirement. Overall, investing in stocks can help meet life's financial aspirations.
It is important to maintain a diversified portfolio when investing in stocks to maximize earnings under any market condition. Having a diversified portfolio tends to yield higher returns and face lower risk by tempering potential losses when the market is down. It is often easy to get lost in a sea of financial metrics to analyze while determining the worth of a stock, and doing the same for a multitude of stocks to identify the right picks for an individual can be a tedious task. By doing a cluster analysis, one can identify stocks that exhibit similar characteristics and ones that exhibit minimum correlation. This will help investors better analyze stocks across different market segments and help protect against risks that could make the portfolio vulnerable to losses.
Trade&Ahead is a financial consultancy firm that provides its customers with personalized investment strategies. They have hired you as a Data Scientist and provided you with data comprising stock price and some financial indicators for a few companies listed under the New York Stock Exchange. They have assigned you the tasks of analyzing the data, grouping the stocks based on the attributes provided, and sharing insights about the characteristics of each group.
Ticker Symbol : An abbreviation used to uniquely identify publicly traded shares of a particular stock on a particular stock marketCompany : Name of the companyGICS Sector : The specific economic sector assigned to a company by the Global Industry Classification Standard (GICS) that best defines its business operationsGICS Sub Industry : The specific sub-industry group assigned to a company by the Global Industry Classification Standard (GICS) that best defines its business operationsCurrent Price : Current stock price in dollarsPrice Change : Percentage change in the stock price in 13 weeksVolatility : Standard deviation of the stock price over the past 13 weeksROE : A measure of financial performance calculated by dividing net income by shareholders' equity (shareholders' equity is equal to a company's assets minus its debt)Cash Ratio : The ratio of a company's total reserves of cash and cash equivalents to its total current liabilitiesNet Cash Flow : The difference between a company's cash inflows and outflows (in dollars)Net Income : Revenues minus expenses, interest, and taxes (in dollars)Earnings Per Share : Company's net profit divided by the number of common shares it has outstanding (in dollars)Estimated Shares Outstanding : Company's stock is currently held by all its shareholdersP/E Ratio : Ratio of the company's current stock price to the earnings per shareP/B Ratio : Ratio of the company's stock price per share by its book value per share (book value of a company is the net difference between that company's total assets and total liabilities)# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# For cross-tabulation (or contingency table)
#from pandas import crosstab
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style='darkgrid')
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# to scale the data using z-score
from scipy.stats import zscore
# to compute distances
from scipy.spatial.distance import cdist, pdist
# to perform k-means clustering and compute silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# Dimentionality reduction
from sklearn.decomposition import PCA
# to visualize the elbow curve and silhouette scores
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
# to perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet
# to suppress warnings
import warnings
warnings.filterwarnings("ignore")
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
data = pd.read_csv("/content/drive/MyDrive/AIML/Unsupervised_Learning/stock_data.csv")
df = data.copy()
First and last 5 rows of the dataset
# looking at head (first 5 observations)
df.head()
| Ticker Symbol | Security | GICS Sector | GICS Sub Industry | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AAL | American Airlines Group | Industrials | Airlines | 42.349998 | 9.999995 | 1.687151 | 135 | 51 | -604000000 | 7610000000 | 11.39 | 6.681299e+08 | 3.718174 | -8.784219 |
| 1 | ABBV | AbbVie | Health Care | Pharmaceuticals | 59.240002 | 8.339433 | 2.197887 | 130 | 77 | 51000000 | 5144000000 | 3.15 | 1.633016e+09 | 18.806350 | -8.750068 |
| 2 | ABT | Abbott Laboratories | Health Care | Health Care Equipment | 44.910000 | 11.301121 | 1.273646 | 21 | 67 | 938000000 | 4423000000 | 2.94 | 1.504422e+09 | 15.275510 | -0.394171 |
| 3 | ADBE | Adobe Systems Inc | Information Technology | Application Software | 93.940002 | 13.977195 | 1.357679 | 9 | 180 | -240840000 | 629551000 | 1.26 | 4.996437e+08 | 74.555557 | 4.199651 |
| 4 | ADI | Analog Devices, Inc. | Information Technology | Semiconductors | 55.320000 | -1.827858 | 1.701169 | 14 | 272 | 315120000 | 696878000 | 0.31 | 2.247994e+09 | 178.451613 | 1.059810 |
# looking at tail (last 5 observations)
df.tail()
| Ticker Symbol | Security | GICS Sector | GICS Sub Industry | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 335 | YHOO | Yahoo Inc. | Information Technology | Internet Software & Services | 33.259998 | 14.887727 | 1.845149 | 15 | 459 | -1032187000 | -4359082000 | -4.64 | 939457327.6 | 28.976191 | 6.261775 |
| 336 | YUM | Yum! Brands Inc | Consumer Discretionary | Restaurants | 52.516175 | -8.698917 | 1.478877 | 142 | 27 | 159000000 | 1293000000 | 2.97 | 435353535.4 | 17.682214 | -3.838260 |
| 337 | ZBH | Zimmer Biomet Holdings | Health Care | Health Care Equipment | 102.589996 | 9.347683 | 1.404206 | 1 | 100 | 376000000 | 147000000 | 0.78 | 188461538.5 | 131.525636 | -23.884449 |
| 338 | ZION | Zions Bancorp | Financials | Regional Banks | 27.299999 | -1.158588 | 1.468176 | 4 | 99 | -43623000 | 309471000 | 1.20 | 257892500.0 | 22.749999 | -0.063096 |
| 339 | ZTS | Zoetis | Health Care | Pharmaceuticals | 47.919998 | 16.678836 | 1.610285 | 32 | 65 | 272000000 | 339000000 | 0.68 | 498529411.8 | 70.470585 | 1.723068 |
#Checking the shape of the dataset
df.shape
(340, 15)
#Checking the data types of the columns for the dataset
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 340 entries, 0 to 339 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Ticker Symbol 340 non-null object 1 Security 340 non-null object 2 GICS Sector 340 non-null object 3 GICS Sub Industry 340 non-null object 4 Current Price 340 non-null float64 5 Price Change 340 non-null float64 6 Volatility 340 non-null float64 7 ROE 340 non-null int64 8 Cash Ratio 340 non-null int64 9 Net Cash Flow 340 non-null int64 10 Net Income 340 non-null int64 11 Earnings Per Share 340 non-null float64 12 Estimated Shares Outstanding 340 non-null float64 13 P/E Ratio 340 non-null float64 14 P/B Ratio 340 non-null float64 dtypes: float64(7), int64(4), object(4) memory usage: 40.0+ KB
df.duplicated().sum()
0
df.isnull().sum()
| 0 | |
|---|---|
| Ticker Symbol | 0 |
| Security | 0 |
| GICS Sector | 0 |
| GICS Sub Industry | 0 |
| Current Price | 0 |
| Price Change | 0 |
| Volatility | 0 |
| ROE | 0 |
| Cash Ratio | 0 |
| Net Cash Flow | 0 |
| Net Income | 0 |
| Earnings Per Share | 0 |
| Estimated Shares Outstanding | 0 |
| P/E Ratio | 0 |
| P/B Ratio | 0 |
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Current Price | 340.0 | 8.086234e+01 | 9.805509e+01 | 4.500000e+00 | 3.855500e+01 | 5.970500e+01 | 9.288000e+01 | 1.274950e+03 |
| Price Change | 340.0 | 4.078194e+00 | 1.200634e+01 | -4.712969e+01 | -9.394838e-01 | 4.819505e+00 | 1.069549e+01 | 5.505168e+01 |
| Volatility | 340.0 | 1.525976e+00 | 5.917984e-01 | 7.331632e-01 | 1.134878e+00 | 1.385593e+00 | 1.695549e+00 | 4.580042e+00 |
| ROE | 340.0 | 3.959706e+01 | 9.654754e+01 | 1.000000e+00 | 9.750000e+00 | 1.500000e+01 | 2.700000e+01 | 9.170000e+02 |
| Cash Ratio | 340.0 | 7.002353e+01 | 9.042133e+01 | 0.000000e+00 | 1.800000e+01 | 4.700000e+01 | 9.900000e+01 | 9.580000e+02 |
| Net Cash Flow | 340.0 | 5.553762e+07 | 1.946365e+09 | -1.120800e+10 | -1.939065e+08 | 2.098000e+06 | 1.698108e+08 | 2.076400e+10 |
| Net Income | 340.0 | 1.494385e+09 | 3.940150e+09 | -2.352800e+10 | 3.523012e+08 | 7.073360e+08 | 1.899000e+09 | 2.444200e+10 |
| Earnings Per Share | 340.0 | 2.776662e+00 | 6.587779e+00 | -6.120000e+01 | 1.557500e+00 | 2.895000e+00 | 4.620000e+00 | 5.009000e+01 |
| Estimated Shares Outstanding | 340.0 | 5.770283e+08 | 8.458496e+08 | 2.767216e+07 | 1.588482e+08 | 3.096751e+08 | 5.731175e+08 | 6.159292e+09 |
| P/E Ratio | 340.0 | 3.261256e+01 | 4.434873e+01 | 2.935451e+00 | 1.504465e+01 | 2.081988e+01 | 3.176476e+01 | 5.280391e+02 |
| P/B Ratio | 340.0 | -1.718249e+00 | 1.396691e+01 | -7.611908e+01 | -4.352056e+00 | -1.067170e+00 | 3.917066e+00 | 1.290646e+02 |
#provide statistical summary of all categorical columns
df.describe(include='category').T
| count | unique | top | freq | |
|---|---|---|---|---|
| Ticker Symbol | 340 | 340 | AAL | 1 |
| Security | 340 | 340 | 3M Company | 1 |
| GICS Sector | 340 | 11 | Industrials | 53 |
| GICS Sub Industry | 340 | 104 | Oil & Gas Exploration & Production | 16 |
Observations:
# Convert all object type columns to category
df = df.apply(lambda col : col.astype("category") if col.dtype == "object"
else col)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 340 entries, 0 to 339 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Ticker Symbol 340 non-null category 1 Security 340 non-null category 2 GICS Sector 340 non-null category 3 GICS Sub Industry 340 non-null category 4 Current Price 340 non-null float64 5 Price Change 340 non-null float64 6 Volatility 340 non-null float64 7 ROE 340 non-null int64 8 Cash Ratio 340 non-null int64 9 Net Cash Flow 340 non-null int64 10 Net Income 340 non-null int64 11 Earnings Per Share 340 non-null float64 12 Estimated Shares Outstanding 340 non-null float64 13 P/E Ratio 340 non-null float64 14 P/B Ratio 340 non-null float64 dtypes: category(4), float64(7), int64(4) memory usage: 58.1 KB
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=12)
ax = sns.countplot(
data=data,
x=feature,
palette="viridis",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
# function to plot a boxplot and a histogram along the same scale
def histogram_boxplot(data, feature, figsize=(16, 6), kde=False, bins=None, hue=None):
"""
Combines boxplot and histogram
data: dataframe
feature: dataframe column
figsize: size of figure (default (16,6))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True,
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter",
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
Observation on GICS Sector
#create labeled barplot of stocks by sector
labeled_barplot(df, 'GICS Sector')
#display the five sectors with the most number of stocks
df["GICS Sector"].value_counts(normalize = True).sort_values(ascending = False).head(n=5)
| proportion | |
|---|---|
| GICS Sector | |
| Industrials | 0.155882 |
| Financials | 0.144118 |
| Consumer Discretionary | 0.117647 |
| Health Care | 0.117647 |
| Information Technology | 0.097059 |
GICS Sub Industry
#create labeled barplot of stocks by sub industry
labeled_barplot(df, 'GICS Sub Industry')
#display the five sub industries with the most number of stocks
df['GICS Sub Industry'].value_counts().head(n=5)
| count | |
|---|---|
| GICS Sub Industry | |
| Oil & Gas Exploration & Production | 16 |
| REITs | 14 |
| Industrial Conglomerates | 14 |
| Internet Software & Services | 12 |
| Electric Utilities | 12 |
Numerical Columns
#create list of columns with numerical variables
num_col = df.select_dtypes(include=np.number).columns.tolist()
#display histograms and boxplots for all numerical columns
for col in num_col:
histogram_boxplot(df, col)
Current price
Price change
Volatility
Cash Ratio / ROE
Net Income / EPS
Estimated shares outstanding
P/E and P/B Ratio
Conclusion
1. What does the distribution of stock prices look like?
# Histogram
plt.figure(figsize=(10, 4))
sns.histplot(df['Current Price'], kde=True)
plt.title('Histogram of Stock Prices')
plt.xlabel('Current Price ($)')
plt.ylabel('Frequency')
plt.show()
# Boxplot
plt.figure(figsize=(10, 4))
sns.boxplot(x=df['Current Price'])
plt.title('Boxplot of Stock Prices')
plt.xlabel('Current Price ($)')
plt.show()
Observations:
The distribution of stock prices is positively skewed with most of the stock prices concentrated at the lower end of the price range and a long tail stretching towards higher prices, indicating that there are a few stocks with very high prices compared to the majority.
2. The stocks of which economic sector has seen the maximum price increase on average?
# Group by 'GICS Sector' and calculate the mean 'Price Change'
avg_price_increase_by_sector = df.groupby('GICS Sector')['Price Change'].mean().reset_index()
# Sort the results
avg_price_increase_by_sector_sorted = avg_price_increase_by_sector.sort_values('Price Change', ascending=False)
# Bar plot
plt.figure(figsize=(10, 6))
sns.barplot(x='Price Change', y='GICS Sector', data=avg_price_increase_by_sector_sorted, palette='Blues_r')
plt.title('Average Price Increase by GICS Sector')
plt.xlabel('Average Price Increase (%)')
plt.ylabel('GICS Sector')
plt.show()
Observations:
The bar chart indicates that on average, the Information Technology sector has experienced the highest price increase among all GICS sectors, while the Energy sector has seen a price decrease. Other sectors like Health Care, Consumer Discretionary, and Materials also show notable average price increases.
3. How are the different variables correlated with each other?
#create correlation heat map for numerical variables
plt.figure(figsize=(14, 7))
sns.heatmap(
df[num_col].corr(),
annot=True,
vmin=-1,
vmax=1,
fmt=".2f",
cmap='Blues'
)
plt.show()
Observations:
4. Cash ratio measures a company's ability to cover its short-term obligations using only cash and cash equivalents. How does the average cash ratio vary across economic sectors?
# Calculate the mean 'Cash Ratio' for each 'GICS Sector'
avg_cash_ratio_by_sector = df.groupby('GICS Sector')['Cash Ratio'].mean().reset_index()
# Bar plot
plt.figure(figsize=(10, 6))
sns.barplot(x='Cash Ratio', y='GICS Sector', data=avg_cash_ratio_by_sector, palette='YlOrRd_r')
plt.title('Average Cash Ratio by GICS Sector')
plt.xlabel('Average Cash Ratio')
plt.ylabel('GICS Sector')
plt.show()
Observations:
IT and Telecom Services sectors have the lhighest average cash ratios meaning financial stability.
5. P/E ratios can help determine the relative value of a company's shares as they signify the amount of money an investor is willing to invest in a single share of a company per dollar of its earnings. How does the P/E ratio vary, on average, across economic sectors?
# Calculate the mean 'P/E Ratio' for each 'GICS Sector'
avg_pe_ratio_by_sector = df.groupby('GICS Sector')['P/E Ratio'].mean().reset_index()
# Bar plot
plt.figure(figsize=(10, 6))
sns.barplot(x='P/E Ratio', y='GICS Sector', data=avg_pe_ratio_by_sector, palette='Oranges_r')
plt.title('Average P/E Ratio by GICS Sector')
plt.xlabel('Average P/E Ratio')
plt.ylabel('GICS Sector')
plt.show()
Observations:
A longer bar indicates a higher P/E ratio, like Energy, suggesting that, on average, stocks in that sector may be priced higher relative to their earnings.
Functions for Numerical and Categorical Feature Visualisation
def hist_and_boxplot(data, variable, figsize=(12, 4), kde=False, bins=None):
"""
Creates a plot with both a histogram and boxplot for a specified numerical variable.
Args:
- data: The DataFrame containing the data.
- variable: The name of the numerical variable (feature) to be plotted.
- figsize: A tuple representing the size of the figure.
- density_curve: A boolean indicating whether to overlay a density curve curve on the histogram.
- bins: An integer representing the number of bins for the histogram, or None for automatic bin size.
Returns:
None
"""
# Set up the matplotlib figure with two rows and one column
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=figsize, sharex=True, gridspec_kw={'height_ratios': [0.2, 0.8]})
# Plot the boxplot on the first row
sns.boxplot(x=variable, data=data, ax=ax1, showmeans=True, color="lightblue")
ax1.set(xlabel='', title=f'Boxplot and Distribution of {variable}')
# Plot the histogram on the second row
if bins:
sns.histplot(data[variable], kde=kde, bins=bins, ax=ax2, color="lightblue")
else:
sns.histplot(data[variable], kde=kde, ax=ax2, color="lightblue")
# Draw lines for mean and median
mean_val = data[variable].mean()
median_val = data[variable].median()
ax2.axvline(mean_val, color='green', linestyle='--', linewidth=2, label=f'Mean: {mean_val:.2f}')
ax2.axvline(median_val, color='black', linestyle='-', linewidth=2, label=f'Median: {median_val:.2f}')
# Add legend to the histogram
ax2.legend()
plt.show()
def custom_barplot(data, feature, perc=False, top_n=None):
"""
Creates a bar plot with labels for each bar from the dataframe data.
Args:
- data: The DataFrame containing the data.
- feature: The column name to plot.
- perc: A boolean to indicate whether to display the percentage (True) or count (False).
- top_n: An integer to display only the top N categories, or None to display all.
Returns:
None
"""
total_entries = len(data[feature]) # The total number of entries
category_counts = data[feature].value_counts()
# Determine the figure size based on the number of categories
fig_width = top_n + 2 if top_n is not None else category_counts.size + 2
plt.figure(figsize=(fig_width, 6))
# Create the bar plot
category_order = category_counts.index[:top_n] if top_n is not None else category_counts.index
barplot = sns.countplot(x=feature, data=data, order=category_order, palette='Blues_r')
# Add labels to each bar
for bar in barplot.patches:
label_x_pos = bar.get_x() + bar.get_width() / 2
if perc:
label = f"{100 * bar.get_height() / total_entries:.1f}%"
else:
label = f"{bar.get_height()}"
barplot.annotate(
label,
(label_x_pos, bar.get_height()),
ha='center',
va='bottom',
fontsize=11,
textcoords='offset points',
xytext=(0, 0)
)
# Rotate x-axis labels if there are many categories
if category_counts.size > 10:
plt.xticks(rotation=45, ha='right')
# Adding a title
plt.title(f'Distribution of {feature}', fontsize=15)
plt.show()
Numerical Data
# Get a list of the numerical variables
num_var = data.select_dtypes(include=['int64','float64']).columns
print(num_var)
Index(['Current Price', 'Price Change', 'Volatility', 'ROE', 'Cash Ratio',
'Net Cash Flow', 'Net Income', 'Earnings Per Share',
'Estimated Shares Outstanding', 'P/E Ratio', 'P/B Ratio'],
dtype='object')
# Plot all the numerical variables in the data DataFrame
numerical_variables = [
'Current Price',
'Price Change',
'Volatility',
'ROE',
'Cash Ratio',
'Net Cash Flow',
'Net Income',
'Earnings Per Share',
'Estimated Shares Outstanding',
'P/E Ratio',
'P/B Ratio'
]
# Loop through the list and creating a histogram and boxplot for each variable
for variable in numerical_variables:
hist_and_boxplot(data, variable)
Observations:
Categorical Data
# Create a barplot for 'GICS Sector'
custom_barplot(data=df, feature='GICS Sector', perc=True, top_n=None)
Observations:
Industrials and Financials sectors lead in distribution, while Telecommunications Services is the least represented in the dataset.
The EDA reveals a diverse and dynamic stock market landscape. Key takeaways include:
Based on the EDA, here are some recommendations:
# Drop Ticker Symbol since it is a unique identifier (same as 'Security')
df.drop(['Ticker Symbol'], axis=1, inplace=True)
df.columns
Index(['Security', 'GICS Sector', 'GICS Sub Industry', 'Current Price',
'Price Change', 'Volatility', 'ROE', 'Cash Ratio', 'Net Cash Flow',
'Net Income', 'Earnings Per Share', 'Estimated Shares Outstanding',
'P/E Ratio', 'P/B Ratio'],
dtype='object')
# Drop categorical variables before scaling
df_num = df.select_dtypes(include=[np.number])
# Apply the z score to each column
df_scaled=df_num.apply(zscore)
# Check the sclaed dataframe
df_scaled.head()
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.393341 | 0.493950 | 0.272749 | 0.989601 | -0.210698 | -0.339355 | 1.554415 | 1.309399 | 0.107863 | -0.652487 | -0.506653 |
| 1 | -0.220837 | 0.355439 | 1.137045 | 0.937737 | 0.077269 | -0.002335 | 0.927628 | 0.056755 | 1.250274 | -0.311769 | -0.504205 |
| 2 | -0.367195 | 0.602479 | -0.427007 | -0.192905 | -0.033488 | 0.454058 | 0.744371 | 0.024831 | 1.098021 | -0.391502 | 0.094941 |
| 3 | 0.133567 | 0.825696 | -0.284802 | -0.317379 | 1.218059 | -0.152497 | -0.219816 | -0.230563 | -0.091622 | 0.947148 | 0.424333 |
| 4 | -0.260874 | -0.492636 | 0.296470 | -0.265515 | 2.237018 | 0.133564 | -0.202703 | -0.374982 | 1.978399 | 3.293307 | 0.199196 |
sns.pairplot(df_scaled, height=2,aspect=2 , diag_kind='kde')
plt.show()
Observation:
# Calculate the covariance matrix (captures the variance and covariance between features)
covMatrix = np.cov(df_scaled,rowvar=False)
print(covMatrix)
[[ 1.00294985e+00 1.35379678e-01 -1.24623162e-01 -5.51076941e-04 1.28193172e-01 -2.20255018e-02 3.67332609e-02 4.81018703e-01 -1.45735035e-01 2.61040375e-01 1.25001740e-01] [ 1.35379678e-01 1.00294985e+00 -4.09485757e-01 -4.34381685e-02 1.69082964e-01 2.68245310e-02 1.79826528e-01 1.75918225e-01 3.37550438e-02 -2.49801709e-03 3.44304592e-02] [-1.24623162e-01 -4.09485757e-01 1.00294985e+00 1.63011319e-01 2.06659452e-02 -1.42137739e-02 -3.84563951e-01 -3.80427812e-01 -9.59867699e-02 2.64377469e-01 4.66653027e-02] [-5.51076941e-04 -4.34381685e-02 1.63011319e-01 1.00294985e+00 -6.93260015e-02 -5.23589005e-02 -2.89375193e-01 -4.06486472e-01 -3.03773642e-02 2.34095558e-02 -6.34048877e-02] [ 1.28193172e-01 1.69082964e-01 2.06659452e-02 -6.93260015e-02 1.00294985e+00 1.14154326e-01 2.86730699e-02 2.48317919e-02 1.36269927e-01 8.97470272e-02 2.30349368e-01] [-2.20255018e-02 2.68245310e-02 -1.42137739e-02 -5.23589005e-02 1.14154326e-01 1.00294985e+00 4.43591948e-02 1.94047846e-02 -5.15439469e-02 2.69553828e-02 5.72075893e-02] [ 3.67332609e-02 1.79826528e-01 -3.84563951e-01 -2.89375193e-01 2.86730699e-02 4.43591948e-02 1.00294985e+00 5.59397718e-01 5.91048859e-01 -2.24683029e-01 -6.17250714e-02] [ 4.81018703e-01 1.75918225e-01 -3.80427812e-01 -4.06486472e-01 2.48317919e-02 1.94047846e-02 5.59397718e-01 1.00294985e+00 -2.69850312e-02 -2.61428450e-01 1.49589029e-02] [-1.45735035e-01 3.37550438e-02 -9.59867699e-02 -3.03773642e-02 1.36269927e-01 -5.15439469e-02 5.91048859e-01 -2.69850312e-02 1.00294985e+00 -1.45943101e-02 -2.82382975e-02] [ 2.61040375e-01 -2.49801709e-03 2.64377469e-01 2.34095558e-02 8.97470272e-02 2.69553828e-02 -2.24683029e-01 -2.61428450e-01 -1.45943101e-02 1.00294985e+00 3.65292109e-02] [ 1.25001740e-01 3.44304592e-02 4.66653027e-02 -6.34048877e-02 2.30349368e-01 5.72075893e-02 -6.17250714e-02 1.49589029e-02 -2.82382975e-02 3.65292109e-02 1.00294985e+00]]
# Initialize PCA with 10 components and fit model to scaled data
pca = PCA(n_components=10)
pca.fit(df_scaled)
PCA(n_components=10)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
PCA(n_components=10)
# PCA results
# Print the eigenVALUES associated with each principal component, which indicate the variance captured by each component
print(pca.explained_variance_)
[2.51212401 1.58411327 1.33434149 1.11310406 1.09308721 0.95662194 0.82731447 0.71754029 0.4845015 0.2833637 ]
# Print the eigenVECTORS (principal components), which define the directions of maximum variance in the data
print(pca.components_)
[[ 0.18335823 0.27666842 -0.42919618 -0.30106508 0.08417114 0.03368814 0.51091003 0.5033668 0.22120756 -0.20815091 0.00991673] [ 0.5774047 0.18815749 0.04101083 -0.06258117 0.32773726 0.1280034 -0.25155831 0.18433241 -0.36327549 0.37736302 0.360796 ] [-0.17770732 0.04646452 0.18283759 0.00473711 0.53919175 0.1855024 0.20918754 -0.27531285 0.58166569 0.24934832 0.2959799 ] [-0.00900284 0.65971773 -0.38664453 0.53431652 0.0838151 -0.13655589 -0.12101272 -0.26011429 0.02333621 -0.00303257 -0.14551133] [-0.39287414 0.12202474 -0.14571667 -0.12446023 0.1588855 0.56275703 -0.17221048 -0.06902206 -0.32807051 -0.47228787 0.29493986] [ 0.04539895 0.07374983 -0.01301984 -0.04924358 -0.1433002 0.69439441 0.07660902 -0.02703535 0.00607268 0.39241333 -0.571456 ] [ 0.34597383 -0.32918961 0.06740711 0.69551664 -0.04384053 0.31995867 0.15598116 0.16995766 0.13220581 -0.30317218 0.12940702] [ 0.03785302 -0.10485332 0.17805858 0.00977663 0.69685159 -0.13119433 -0.09410062 0.13610016 -0.12727392 -0.29100372 -0.56982313] [-0.05374498 0.54680565 0.75429579 0.04388306 -0.18199001 0.01575891 0.16200469 0.20518699 -0.03660507 -0.15469563 0.01551126] [ 0.47507303 0.10833734 0.05038953 -0.32004423 -0.14366391 0.09802191 -0.43017307 -0.28501031 0.45036866 -0.39076807 -0.07444627]]
# Print the proportion of variance explained by each component
print(pca.explained_variance_ratio_)
[0.22770322 0.14358674 0.120947 0.10089366 0.09907929 0.08670985 0.0749892 0.06503908 0.04391604 0.02568457]
# Create a bar chart showing the proportion of variance explained by each principal component
plt.bar(list(range(1,11)),pca.explained_variance_ratio_,alpha=0.5, align='center')
plt.title("Individual Explained Variance Visualization")
plt.ylabel('Variation explained')
plt.xlabel('eigen Value')
plt.show()
# Create a step plot showing the cumulative proportion of variance explained by the components
plt.step(list(range(1,11)),np.cumsum(pca.explained_variance_ratio_), where='mid')
plt.title("Cumulative Explained Variance Visualization")
plt.ylabel('Cum of variation explained')
plt.xlabel('eigen Value')
plt.show()
Observations:
This helps to understand how many components are needed to explain most of the data and it seems like all the components are needed. This suggests that each component is adding some information, and we need all to capture most of the variance in the dataset.
Note: reducing the dimentionality of the data is advised for supervised learning where we have a target variable. For unsupervised learning outliers are data points which can be excluded from the dataset which effect the clustering. In this case we are keeping them but we'd target those as oppose to reducing the number of variables all together.
#print average distortions for range of kmeans models fitted to scaled dataset
clusters = range(1, 11)
meanDistortions = []
for k in clusters:
model = KMeans(n_clusters=k)
model.fit(df_scaled)
prediction = model.predict(df_scaled)
distortion = (
sum(
np.min(cdist(df_scaled, model.cluster_centers_, "euclidean"), axis=1)
)
/ df_scaled.shape[0]
)
meanDistortions.append(distortion)
print("Number of Clusters:", k, "\tAverage Distortion:", distortion)
Number of Clusters: 1 Average Distortion: 2.5425069919221697 Number of Clusters: 2 Average Distortion: 2.4359794966724415 Number of Clusters: 3 Average Distortion: 2.3666102174035797 Number of Clusters: 4 Average Distortion: 2.2150707100717115 Number of Clusters: 5 Average Distortion: 2.148999515589566 Number of Clusters: 6 Average Distortion: 2.0986142664921537 Number of Clusters: 7 Average Distortion: 2.037523546710418 Number of Clusters: 8 Average Distortion: 1.993329796710798 Number of Clusters: 9 Average Distortion: 1.9053944614353846 Number of Clusters: 10 Average Distortion: 1.8673989349526698
#fit KMeans model and use visualizaer to indicate optimal K value
model = KMeans(random_state=42)
visualizer = KElbowVisualizer(model, k=(1, 11), timings=True)
visualizer.fit(df_scaled) # fit the data to the visualizer
visualizer.show() # finalize and render figure
plt.show()
#fit KMeans model and provide silhouette scores for range of k clusters
sil_score = []
cluster_list = range(2, 11)
for n_clusters in cluster_list:
clusterer = KMeans(n_clusters=n_clusters, random_state=42)
preds = clusterer.fit_predict((df_scaled))
score = silhouette_score(df_scaled, preds)
sil_score.append(score)
print("For n_clusters = {}, the silhouette score is {})".format(n_clusters, score))
#show scores in line graph
plt.plot(cluster_list, sil_score)
plt.show()
For n_clusters = 2, the silhouette score is 0.4484662124740991) For n_clusters = 3, the silhouette score is 0.4419662311474576) For n_clusters = 4, the silhouette score is 0.45434371948348606) For n_clusters = 5, the silhouette score is 0.4670141851729817) For n_clusters = 6, the silhouette score is 0.16609442972204497) For n_clusters = 7, the silhouette score is 0.16010466610937854) For n_clusters = 8, the silhouette score is 0.3811731688385819) For n_clusters = 9, the silhouette score is 0.16110839502412727) For n_clusters = 10, the silhouette score is 0.15641849202440433)
#fit KMeans model and use visualizaer to indicate optimal K value
model = KMeans(random_state=42)
visualizer = KElbowVisualizer(model, k=(2, 11), metric="silhouette", timings=True)
visualizer.fit(df_scaled) # fit the data to the visualizer
visualizer.show() # finalize and render figure
plt.show()
#find optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(5, random_state=42))
visualizer.fit(df_scaled)
visualizer.show()
plt.show()
Between the Elbow and Silhouette plots, the number of clusters with the best performance appears to be 5
# Initialize the KMeans with k=4
model4 = KMeans(n_clusters=4, random_state=1)
# Initialize the SilhouetteVisualizer with the KMeans model
visualizer = SilhouetteVisualizer(model4, colors='yellowbrick')
# Fit the visualizer to your scaled data
visualizer.fit(df_scaled)
# Display the silhouette plot
visualizer.show()
<Axes: title={'center': 'Silhouette Plot of KMeans Clustering for 340 Samples in 4 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
The model with k=4 clusters has a slightly higher average silhouette score, suggesting that dividing the data into 4 clusters leads to the most distinct and well-separated groupings but the highest silhouette score is observed for n_clusters = 5. So, Investigate 5.
k-Means=5 Model
# Initialize KMeans with 5 clusters
kmeans5_model = KMeans(n_clusters=5, random_state=42)
# Fit KMeans model
kmeans5_model.fit(df_scaled)
# Model predicts the cluster for each data point in the scaled data
prediction=kmeans5_model.predict(df_scaled)
# Assign cluster labels to DataFrames
# Add a new column to the original dataframe and assigns the cluster labels to it
df["KMeans_Cluster"] = prediction
df_scaled["KMeans_Cluster"] = prediction # do it to the scaled dataframe as well
print("Groups Assigned : \n")
df.head()
Groups Assigned :
| Security | GICS Sector | GICS Sub Industry | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | KMeans_Cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | American Airlines Group | Industrials | Airlines | 42.349998 | 9.999995 | 1.687151 | 135 | 51 | -604000000 | 7610000000 | 11.39 | 6.681299e+08 | 3.718174 | -8.784219 | 2 |
| 1 | AbbVie | Health Care | Pharmaceuticals | 59.240002 | 8.339433 | 2.197887 | 130 | 77 | 51000000 | 5144000000 | 3.15 | 1.633016e+09 | 18.806350 | -8.750068 | 2 |
| 2 | Abbott Laboratories | Health Care | Health Care Equipment | 44.910000 | 11.301121 | 1.273646 | 21 | 67 | 938000000 | 4423000000 | 2.94 | 1.504422e+09 | 15.275510 | -0.394171 | 2 |
| 3 | Adobe Systems Inc | Information Technology | Application Software | 93.940002 | 13.977195 | 1.357679 | 9 | 180 | -240840000 | 629551000 | 1.26 | 4.996437e+08 | 74.555557 | 4.199651 | 2 |
| 4 | Analog Devices, Inc. | Information Technology | Semiconductors | 55.320000 | -1.827858 | 1.701169 | 14 | 272 | 315120000 | 696878000 | 0.31 | 2.247994e+09 | 178.451613 | 1.059810 | 4 |
# Create boxplots for each feature within each cluster
df_scaled.boxplot(by='KMeans_Cluster', layout = (4,3),figsize=(30,30))
plt.show()
Observations:
Almost all clusters have their medians close to zero indicating a consistency within these clusters.
Cash Ratio: Cluster 4 has a slightly higher median cash ratio. Outliers are present in clusters 0 and 4, suggesting a few companies with high cash ratios.
Current Price: Cluster 4 has a higher median price and more variability. Outliers are present in most clusters, particularly clusters 0 and 4, indicating some stocks with higher prices.
Earnings Per Share: Clusters 3 and 4 having a wider interquartile range, indicating more variability within those clusters.
Estimated Shares Outstanding: Cluster 4 shows a wider interquartile range. Outliers are present in cluster 0, indicating some companies with a high number of shares outstanding.
Net Cash Flow: Clusters 0 and 1 show wider distributions with the median around zero. Clusters 2, 3, and 4 have tighter distributions. There are outliers in clusters indicating some companies with either high or low net cash flow.
Net Income: Cluster 0 has a wider distribution, and cluster 4 shows some high-income outliers.
P/B Ratio: The price-to-book ratio across all clusters is centered around zero, with cluster 4 having the widest interquartile range and outliers, indicating more variability in the P/B ratio within this cluster.
P/E Ratio: The price-to-earnings ratio showing some outliers, indicating a few companies with high or low P/E ratios.
Price Change: All clusters have medians around zero with similar stock price changes.
ROE: The return on equity is centered around zero for all clusters.
Volatility: Cluster 3 shows a higher median volatility and a wider spread, indicating this cluster contains stocks that are more volatile.
# Generate a color map with a unique color for each bar
num_of_bars = len(df_scaled.columns) - 1 # Subtract 1 for the GROUP column
cm = plt.get_cmap('BuPu_r')
colors = [cm(1. * i / num_of_bars) for i in range(num_of_bars)]
# Plot the bar chart with unique colors
grouped_means = df_scaled.groupby("KMeans_Cluster").mean()
grouped_means.plot.bar(figsize=(15, 8), color=colors)
plt.title("KMeans of Clusters")
plt.show()
Observations:
Visualization of the mean values of each feature within scaled dataset, grouped by KMeans clusters.
Cluster 2: low variability within this group suggests that the data points in this cluster are quite similar to each other across all the features.
Cluster 0 and 3: show high variability in a group capturing a wider range of characteristics within its members.
Cluster 1 and 4: show mostly positive values.
Current Price in Cluster 4 shows significant presence which means it is a significant feature of this group.
Some features P/E Ratio and P/B Ratio seem to follow similar trends.
# Create a new DataFrame without the 'KMeans_Cluster' labels
df_scaled_hclust = df_scaled.drop(columns=['KMeans_Cluster'], errors='ignore')
df_scaled_hclust.head()
| Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.393341 | 0.493950 | 0.272749 | 0.989601 | -0.210698 | -0.339355 | 1.554415 | 1.309399 | 0.107863 | -0.652487 | -0.506653 |
| 1 | -0.220837 | 0.355439 | 1.137045 | 0.937737 | 0.077269 | -0.002335 | 0.927628 | 0.056755 | 1.250274 | -0.311769 | -0.504205 |
| 2 | -0.367195 | 0.602479 | -0.427007 | -0.192905 | -0.033488 | 0.454058 | 0.744371 | 0.024831 | 1.098021 | -0.391502 | 0.094941 |
| 3 | 0.133567 | 0.825696 | -0.284802 | -0.317379 | 1.218059 | -0.152497 | -0.219816 | -0.230563 | -0.091622 | 0.947148 | 0.424333 |
| 4 | -0.260874 | -0.492636 | 0.296470 | -0.265515 | 2.237018 | 0.133564 | -0.202703 | -0.374982 | 1.978399 | 3.293307 | 0.199196 |
Calculate the Cophenetic Correlation Coefficient (CCC)
# Step 1: Exploring Different Distance Metrics and Linkage Methods
distance_metrics = ["euclidean", "chebyshev", "mahalanobis", "cityblock"]
linkage_methods = ["single", "complete", "average", "weighted"] # "centroid" and "ward" require Euclidean distance
high_cophenet_corr = 0
high_dm_lm = [0, 0]
for dm in distance_metrics:
for lm in linkage_methods:
Z = linkage(df_scaled_hclust, metric=dm, method=lm)
c, coph_dists = cophenet(Z, pdist(df_scaled_hclust))
print(f"Cophenetic correlation for {dm} distance and {lm} linkage is {c}.")
if high_cophenet_corr < c:
high_cophenet_corr = c
high_dm_lm = [dm, lm]
print(f"\n Highest CCC is: {high_cophenet_corr}, obtained with {high_dm_lm[0]} distance and {high_dm_lm[1]} linkage.")
Cophenetic correlation for euclidean distance and single linkage is 0.9232271494002922. Cophenetic correlation for euclidean distance and complete linkage is 0.7873280186580672. Cophenetic correlation for euclidean distance and average linkage is 0.9422540609560814. Cophenetic correlation for euclidean distance and weighted linkage is 0.8693784298129404. Cophenetic correlation for chebyshev distance and single linkage is 0.9062538164750717. Cophenetic correlation for chebyshev distance and complete linkage is 0.598891419111242. Cophenetic correlation for chebyshev distance and average linkage is 0.9338265528030499. Cophenetic correlation for chebyshev distance and weighted linkage is 0.9127355892367. Cophenetic correlation for mahalanobis distance and single linkage is 0.925919553052459. Cophenetic correlation for mahalanobis distance and complete linkage is 0.7925307202850002. Cophenetic correlation for mahalanobis distance and average linkage is 0.9247324030159736. Cophenetic correlation for mahalanobis distance and weighted linkage is 0.8708317490180428. Cophenetic correlation for cityblock distance and single linkage is 0.9334186366528574. Cophenetic correlation for cityblock distance and complete linkage is 0.7375328863205818. Cophenetic correlation for cityblock distance and average linkage is 0.9302145048594667. Cophenetic correlation for cityblock distance and weighted linkage is 0.731045513520281. Highest CCC is: 0.9422540609560814, obtained with euclidean distance and average linkage.
# Step 2: Focusing on Euclidean Distance with the linkage methods including "centroid" and "ward"
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]
for lm in linkage_methods:
Z = linkage(df_scaled_hclust, metric="euclidean", method=lm)
c, coph_dists = cophenet(Z, pdist(df_scaled_hclust))
print(f"Cophenetic correlation for Euclidean distance and {lm} linkage is {c}.")
Cophenetic correlation for Euclidean distance and single linkage is 0.9232271494002922. Cophenetic correlation for Euclidean distance and complete linkage is 0.7873280186580672. Cophenetic correlation for Euclidean distance and average linkage is 0.9422540609560814. Cophenetic correlation for Euclidean distance and centroid linkage is 0.9314012446828154. Cophenetic correlation for Euclidean distance and ward linkage is 0.7101180299865353. Cophenetic correlation for Euclidean distance and weighted linkage is 0.8693784298129404.
# Step 3: Plotting Dendrograms for Different Linkage Methods
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))
for i, method in enumerate(linkage_methods):
Z = linkage(df_scaled_hclust, metric="euclidean", method=method)
dendrogram(Z, ax=axs[i])
axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")
coph_corr, coph_dist = cophenet(Z, pdist(df_scaled_hclust))
axs[i].annotate(f"Cophenetic\nCorrelation\n{coph_corr:0.2f}", (0.80, 0.80), xycoords="axes fraction")
plt.tight_layout()
plt.show()
Observations:
# Use the linkage matrix from the 'average' method for dendrogram and cluster labels
Z = linkage(df_scaled_hclust, metric='euclidean', method='average')
# Plot the dendrogram
plt.figure(figsize=(12, 8))
dendrogram(Z)
# Find the distance at which the 6th last merge occurs
# This is where the get 6 clusters from the dendrogram
threshold_distance = Z[-(6), 2] # The 6th last merge
plt.title('Hierarchical Clustering Dendrogram (Average Linkage)')
plt.xlabel('Sample index')
plt.ylabel('Distance')
# Add a horizontal line to show where we cut the dendrogram to get 5 clusters
plt.axhline(y=threshold_distance, color='r', linestyle='--')
plt.text(x=plt.gca().get_xlim()[1], y=threshold_distance, s=' 6 clusters cut-off', verticalalignment='center', color='r')
plt.show()
# Create the Agglomerative Clustering model with 6 clusters
hclust6_model = AgglomerativeClustering(n_clusters=6, metric='euclidean', linkage='average')
# Fit the model on scaled data
hclust6_model.fit(df_scaled_hclust)
AgglomerativeClustering(linkage='average', n_clusters=6)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
AgglomerativeClustering(linkage='average', n_clusters=6)
# Retrieve the cluster labels from the fitted model and add to DataFrames
# Add the labels to the scaled data
df_scaled_hclust['H_Cluster'] = hclust6_model.labels_ #AgglomerativeClustering does not have a .predict() method
# Assign the labels to original dataframe for further analysis
df['H_Cluster'] = hclust6_model.labels_
print("Groups Assigned : \n")
df.head()
Groups Assigned :
| Security | GICS Sector | GICS Sub Industry | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | KMeans_Cluster | H_Cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | American Airlines Group | Industrials | Airlines | 42.349998 | 9.999995 | 1.687151 | 135 | 51 | -604000000 | 7610000000 | 11.39 | 6.681299e+08 | 3.718174 | -8.784219 | 2 | 0 |
| 1 | AbbVie | Health Care | Pharmaceuticals | 59.240002 | 8.339433 | 2.197887 | 130 | 77 | 51000000 | 5144000000 | 3.15 | 1.633016e+09 | 18.806350 | -8.750068 | 2 | 0 |
| 2 | Abbott Laboratories | Health Care | Health Care Equipment | 44.910000 | 11.301121 | 1.273646 | 21 | 67 | 938000000 | 4423000000 | 2.94 | 1.504422e+09 | 15.275510 | -0.394171 | 2 | 0 |
| 3 | Adobe Systems Inc | Information Technology | Application Software | 93.940002 | 13.977195 | 1.357679 | 9 | 180 | -240840000 | 629551000 | 1.26 | 4.996437e+08 | 74.555557 | 4.199651 | 2 | 0 |
| 4 | Analog Devices, Inc. | Information Technology | Semiconductors | 55.320000 | -1.827858 | 1.701169 | 14 | 272 | 315120000 | 696878000 | 0.31 | 2.247994e+09 | 178.451613 | 1.059810 | 4 | 0 |
# Create boxplots for each feature within each cluster created with H Clust method
df_scaled_hclust.boxplot(by='H_Cluster', layout = (5,3),figsize=(20,30))
plt.show()
Observations:
Cluster 1 accross all feature has many outliers.
Cash Ratio: Most clusters have similar median cash ratios close to 0, with cluster 4 showing a higher median value. Cluster 1 displays outliers with higher cash ratios.
Current Price: The median current price across clusters is close to 0, with clusters 1 having outliers on the higher side.
Earnings Per Share (EPS): Clusters 3 has lower median EPS values compared to other clusters and shows a wide range of EPS, indicating diverse companies within the cluster.
Estimated Shares Outstanding: Most clusters appear to have similar medians around 0 for estimated shares outstanding. Cluster 2 has a notably wider box at higher median than the rest. Cluster 1 has outliers with high shares.
Net Cash Flow: Cluster 2 has a median much higher than 0 for net cash flow. Cluster 0 has outliers with both high and low net cash flow values.
Net Income: The median net income is around 0 for most clusters, with clusters 2 having slightly higher and 3 lower medians. Cluster 1 shows significant outliers indicating high and low net income.
Price-to-Book (P/B) Ratio: All clusters have a median P/B ratio around 0 except 5, which has a higher ratio. Cluster 1 show outliers indicating a higher P/B ratio.
Price-to-Earnings (P/E) Ratio: The median P/E ratios are centered around 0 across clusters, with cluster 1 having outliers with high P/E ratios.
Price Change: Consistent across clusters with median values around 0 except cluster 2 which is lower. Cluster 1 has outliers indicating large and small price change.
Return on Equity (ROE): Cluster 3 has a notably higher median ROE compared to other clusters, which have medians around 0 and shows a a greater range suggesting a more variable ROE within the cluster. Cluster 1 also shows outliers with high ROE ratiow.
Volatility: Cluster 3 has a higher median volatility. Cluster 1 shows greater variance and outliers, indicating it contains stocks with a wider range of volatility.
# Generate a color map with a unique color for each bar
num_of_bars = len(df_scaled_hclust.columns) - 1 # Subtract 1 for the GROUP column
cm = plt.get_cmap('BuPu_r')
colors = [cm(1. * i / num_of_bars) for i in range(num_of_bars)]
# Plot the bar chart with unique colors
grouped_means = df_scaled_hclust.groupby("H_Cluster").mean()
grouped_means.plot.bar(figsize=(15, 8), color=colors)
plt.title("Hierarchical Clustering")
plt.show()
Cluster 0 is characterized by minimal variation across all financial metrics, suggesting a grouping of stocks with relatively stable and average financial characteristics.
Cluster 2 stands out with significant negative values in 'Net Income' and 'Earnings Per Share', indicating a cluster of stocks that may be underperforming in profitability.
Cluster 4 is notable for a high positive mean 'Current Price', possibly representing stocks with higher market valuations or stocks that have seen substantial price appreciation.
Execution Time : K-means and hierarchical clustering both executed rapidly, demonstrating that for this dataset, computational efficiency is high with either method.
Distinct Clusters : Both algorithms identified a major cluster with a significant portion of stocks, indicating a commonality in their characteristics. The smaller clusters suggest potential niche segments within the market.
Overlap in Clusters : Cross-tabulation revealed significant overlap of 280 data points between the two methods, indicating some agreement in cluster assignments.
Recommended Clusters : K-Means suggested 5 clusters, while Hierarchical recommended 6, pointing to a slight difference in clustering structure.
Sector Analysis : K-Means uniquely clustered the Energy sector, which Hierarchical did not replicate, showing method-dependent differences in sector-based clustering.
Client-Centric Portfolios : Trade&Ahead should align the investment recommendations with the client's specific financial goals and risk preferences, possibly using the identified clusters as portfolio frameworks.
Sector-Specific Investments : The unique clustering of the Energy sector in K-Means indicates that it may follow different market dynamics compared to other sectors. Investors could look at this as an opportunity for sector-specific investment strategies or for diversification purposes.
Algorithm Refinement : The difference in cluster suggestions from K-Means and Hierarchical clustering methods indicates a need for algorithm refinement. Combining insights from both methods could lead to a more robust clustering solution.
Product Development : Segmentation might reflect varying technological advancement levels within subindustries. Firms can focus on product development initiatives that cater to the specific trends and customer demands within their segment.